Actuarial Applications of Natural Language Processing Using Transformers

A Case Study for Processing Text Features in an Actuarial Context

Part II – Case Studies on Property Insurance Claim Descriptions - Unsupervised Techniques

By Andreas Troxler, June 2022

In this Part II of the tutorial, you will learn techniques that can be applied in situations with few or no labels. This is very relevant in practice: text data is often available, but labels are missing or sparse!

Let’s get started.

Notebook Overview

This notebook is divided into tutorial is divided into six parts; they are:

  1. Introduction.
    We begin by explaining pre-requisites. Then we turn to loading and exploring the dataset – ca. 6k records of short property insurance claim description which we aim to classify by peril type.

  2. Classify by peril type in a supervised setting.
    To warm up, we apply supervised learning techniques you have learned in Part I to the dataset of this Part II.

  3. Zero-shot classification.
    This technique assigns each text sample to one element of a pre-defined list of candidate expressions. This allows classification without any task-specific training and without using the labels. This fully unsupervised approach is useful in situations with no labels.

  4. Unsupervised classification using similarity.
    This technique encodes each input sentence and each candidate expression into en embedding vector. Then, pairwise similarity scores between each input sequence and each candiate expression are calculated. The candidate expression with the highest similarity score is selected. This fully unsupervised approach is useful in situations with no labels.

  5. Unsupervised topic modeling by clustering of document embeddings.
    This approach extracts clusters of similar text samples and proposes verbal representations of these clusters. The labels are not required, but may be used in the process if available. This technique does not require prior knowledge of candidate expressions.

  6. Conclusion

1. Introduction

In this section we discuss the pre-requisites, load and inspect the dataset.

1.1. Prerequisites

Computing Power

This notebook is computationally intensive. We recommend using a platform with GPU support.

We have run this notebook on Google Colab and on an Amazon EC2 p2.xlarge instance (an older generation of GPU-based instances).

Please note that the results may not be reproducible across platforms and versions.

Local files

Make sure the following files are available in the directory of the notebook:

This notebook will create the following subdirectories:

Getting started with Python and Jupyter Notebook

For this tutorial, we assume that you are already familiar with Python and Jupyter Notebook. We also assume that you have worked through Part I of this tutorial.

In this section, Jupyter Notebook and Python settings are initialized. For code in Python, the PEP8 standard ("PEP = Python Enhancement Proposal") is enforced with minor variations to improve readability.

Importing Required Libraries

If you run this notebook on Google Colab, you will need to install the following libraries:

and loaded:

1.2. Loading the Data

The dataset used throughout this tutorial concerns property insurance claims of the Wisconsin Local Government Property Insurance Fund (LPGIF), made available in the open text project of Frees. The Wisconsin LGPIF is an insurance pool managed by the Wisconsin Office of the Insurance Commissioner. This fund provides insurance protection to local governmental institutions such as counties, schools, libraries, airports, etc. It insures property claims at buildings and motor vehicles, and it excludes certain natural and man-made perils like flood, earthquakes or nuclear accidents.

The data consists of 6’030 records (4’991 in the training set, 1’039 in the test set) which include a claim amount, a short English claim description and a hazard type with 9 different levels: Fire, Lightning, Hail, Wind, WaterW (weather related water claims), WaterNW (other weather claims), Vehicle, Vandalism and Misc (any other).

The training and validation set are available in separate csv files, which we load into Pandas DataFrames, create a single column containing the label, and finally create a dataset.

1.3 Exploring the data

The first records of the training dataset look like this:

Let's look at the distribution of peril types in the training and validation set:

Next, we want to see some statistics on the length of the claim descriptions. To this end, we split the texts into words, with blank spaces as separator. The text length averages to 5 words and does not seem to vary significantly by peril:

To get an impression of the most frequent words, we generate a simple word cloud form all case descriptions. By default, the word cloud excludes so-called stop words (such as articles, prepositions, pronouns, conjunctions, etc.), which are the most common words and do not add much information to the text.

2. Classify by Peril Type in a Supervised Setting

In this section, we will train classifiers to predict the peril type (labels).

We will follow two approaches:

  1. We use a transformer encoder to encode the claim descriptions, and then train a logistic regression classifier to predict the peril type from the encoded descriptions.

  2. We train a transformer encoder with a classifier head directly.

Let's get started.

2.1 Train a Classifier on Encoded Claim Descriptions

We follow the approach presented in Part I of this tutorial.

In this single-language case study, we use the distilbert-base-uncased model. First, we load the model and the tokenizer.

Then we define a function that applies the tokenizer to the column Description of an input batch...

... and we apply this function to the entire dataset by use of the map function:

Next, we apply the function extract_sequence_encoding (implemented in tutorial_utils.py) which applies the model to a batch and extracts the last hidden state, which is the encoded input text.

We fit a dummy classifier (which always predicts the most frequent class) to the mean-pooled encodings.

We also fit a logistic regression classifier and evaluate its performance on the training and validation split. For ease of use, this functionality is implemented in tutorial_utils.py).

This result is encouraging. From the classification report, we see that the perils WaterNW, WaterW and Misc are most difficult to predict.

2.2 Task-specific Training of a Transformer-based Classifier

In this section, we train directly a transformer-based sequence classifier, using the approach described in Part I of this tutorial.

On an AWS EC2 p2.xlarge instance, the run time is about 2 minutes.

We evaluate the model on the test set:

The performance is comparable to that of the logistic regression classifier, with an improved Brier loss and accuracy score. It appears that the model struggles to tell WaterNW apart from WaterW.

3. Zero-shot Classification

There are situations with no or only few labeled data.

Zero-shot classification is an approach that is suited in this case. Zero-shot classification is about classifying text sequences in an unsupervised way (without having training data in advance and building a model).

The model is presented with a text sequence and a list of expressions, and assigns a probability to each expression.

3.1 Demonstration of the approach

In this section you will learn how to apply zero-shot classification to perform the classification by peril type on the claims data described above.

First, we create a dictionary mapping certain verbal expression to peril types:

We set up the zero-shot classifier using the pipeline abstraction. By default, the facebook/bart-large-mnli model is used. By specifying device=0, we use GPU support if available.

Then, we feed the claim descriptions of the entire test set, presenting the classifier with the list of possible choices as the second argument.

We use the test set directly, because zero shot classification requires no training!

On an AWS EC2 p2.xlarge instance, the run time is about 5 minutes.

This returns a list of dict with the following keys:

We store the predictions in a Pandas DataFrame and evaluate the performance.

On the test set, we achieve an accuracy of 65.5% (compared to 29.8% of the dummy classifier). Apparently, the classifier struggles to correctly identify the WaterW cases based on the expression “Weather”. Also, it seems that the expression “Misc” may not be the optimal choice, as it produces many false positives.

3.2 Refinement

To improve the performance on "Misc", we introduce the following heuristic: If the probability assigned to the expression “Misc” is highest but with a margin of less than 50 percentage points to the second-most likely expression, we select the latter.

We export the output to Excel to analyze the prediction errors.

Looking at false predictions in the training set, we observe the following:

Based on these and similar observations, one could refine the approach by adding more candidate expressions, e.g., adding “glass” to hazard type 0 (“Vandalism”), “light pole” and “fence” to hazard type 5 (“Vehicle”), “storm” and “ice” to hazard type 7 (“WaterW”), etc.

However, the computational effort of zero-shot classification scales with the number of candidate expressions times number of samples, so that we don't want to supply too many candidate expressions. Ideally, we would have an approach to extract candidate expressions from the data....

We will look at such an approach in Section 5. Before going there, the next section offers an alternative approach with less computational effort than zero-shot classification.

4. Unsupervised Classification Using Similarity

This approach is similar to the previous one. It is also suitable in situations with no or only few labeled data.

The model is presented with a text sequence and a list of expressions, and selects the expression which is most "similar" to the text sequence. Here, we use cosine-similarity, which is defined as the dot product of two embedding vectors, each normalized to unit length.

4.1 Demonstration of the approach

In this section you will learn how to perform unsupervised classification using similarity.

Again, we will try to predict the peril type from the claim descriptions.

First, we create a dictionary that maps certain verbal expression to peril types:

As you can see, we have applied some of the lessons learned from the previous experiments with the zero-shot classifier. For instance, we have added "Glass" to the list of candidate expressions mapped to "Vandalism".

We use the model sentence-transformers/all-MiniLM-L12-v2, which is a BERT model that produces a sequence of real-valued vectors of length 384. During its pre-training on sentence similarity tasks, mean pooling was applied to convert this sequence into a single vector.

Next, we generate the sentence embeddings of the claim descriptions. To this end, we tokenize them and then pass them into the helper function extract_sequence_encoding, which applies the transformer encoder and extracts the last hidden state. Mean pooling is applied, in line with the way the model was pre-trained. The value True is passed with key word argument normalize to enforce normalization of the output vector to unit length in the Euclidean norm.

Now, the Numpy array x contains the sequence embeddings for each claim description.

The same procedure is applied to the candidate expressions, which are stored in the array y.

Finally, we calculate the pairwise cosine similarity scores by the dot product of the two arrays, and greedily select the peril type with the highest score.

For inspection, we generate a Pandas DataFrame and export it to an Excel file.

The performance is as follows:

This is not bad! We have already improved on the accuracy score obtained by the zero-shot classifier (using a different set of candidate expressions though).

Let’s see how we can improve the results further.

4.2. Refinement

A possible way to improve the performance is to train a classifier using the predicted labels of the previous section. Although this is a supervised learning step, we are not using the original labels, therefore the overall approach is still unsupervised.

First, we predict the labels on the training set, the same way as before, and store them in the DataFrame df_train_copy:

From here, we follow the approach of Section 2. Again, we use distilbert-base-uncased.

First, we tokenize the claim descriptions:

Then, we perform one epoch of training…

… and evaluate the performance on the test set:

The accuracy score has improved by about 2 percentage points.

Compared to the results obtained by zero-shot classification, we observe that the confusion between “Vandalism” and “Vehicle” has strongly reduced. This might be at least partially due to the fact that we have used different candidate expressions.

For a fair comparison, you might want to go back and re-run the zero-shot classification using the new candidate expressions. However, you will have noticed that the sentence similarity approach is much faster to execute. The computational effort for both approaches is dominated by running the respective transformer model. For the zero-shot classification, the model is run behind the scenes for each combination of sample and candidate expression, so that the effort scales with the number of samples times the number of candidate expressions. In contrast, as we have seen above, the similarity approach runs the transformer model once for each input sample and once for each candidate expression, so that the effort scales with the number of samples plus the number of candidate expressions. This allows experimenting with different candidate expressions.

5. Unsupervised Topic Modeling by Clustering of Document Embeddings

In the previous section we have seen the strength of zero-shot classification: No prior training of the language model is required to produce a classification of reasonable quality. However, it may be difficult to provide suitable candidate expressions.

In this section, we present an alternative approach.

The idea is to encode all text samples, to create clusters of "similar" documents and to extract meaningful verbal representations of the clusters.

Several packages are available to perform this task, e.g., BERTopic, Top2Vec and chat-intents. These packages use similar concepts but provide different APIs, hyper-parameters, diagnostics tools, etc.

Here, we use BERTopic.

The algorithm consists of the following steps:

  1. Embed documents:

    • Encode each text sample (document) into a vector - the embedding. This can be based on a BERT model or any other document embedding technique. By default, BERTopic uses all-MiniLM-L6-v2, which is trained in English. In the multi-lingual case it uses paraphrase-multilingual-MiniLM-L12-v2.

  2. Cluster documents:

    • Reduce the dimensionality of the embeddings. This is required because the documents embeddings are high-dimensional, and typically, clustering algorithms have difficulty clustering data in high dimensional space. By default, BERTopic uses UMAP (Uniform Manifold Approximation and Projection for Dimension Reduction) as it preserves both the local and global structure of embeddings quite well.

    • Create clusters of semantically similar documents. By default, BERTopic uses HDBSCAN as it allows to identify outliers.

  3. Create topic representation:

    • Extract and reduce topics with c-TF-IDF. This is a modification of TF-IDF, which applies TD-IDF to the concatenation of all documents within each document cluster, to obtain importance scores for the words within the cluster.

    • Improve coherence and diversity of words with Maximal Marginal Relevance, to find the most coherent words without having too much overlap between the words themselves. This results in the removal of words that do not contribute to a topic.

Let's apply the algorithm to our dataset and examine the results.

5.1. Basic topic modeling

Normally, BERTopic instantiates UMAP and HDBSCAN automatically. Here, we instantiate them manually and pass them to BERTopic, for the following reasons:

Otherwise, we use the default parameters used by BERTopic.

The first output of fit_transform holds the topic ID for each sample. The second output is the probability of the sample belonging to that topic.

In our case, we have obtained ca. 50 clusters. due to randomness of UMAP, the results may differ between runs. Unfortunately, we have not found a way to fix this.

The cluster with ID -1 contains all samples which are considered "noise" because they were not attributed to any cluster.

The function get_topic_info returns the topic ID, the sample count, and a concatenation of the words representing the cluster.

To get a visual impression of the clusters, BERTopic provides the function visualize_topics which embeds the c-TF-IDF representation of the topics in 2D using UMAP and then visualizes the two dimensions using plotly in an interactive view.

We can visualize the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores for each topic representation. Insights can be gained from the relative c-TF-IDF scores between and within topics. Moreover, you can easily compare topic representations to each other. To visualize this hierarchy, simply call the function visualize_barchart:

BERTopics creates topics in a hierarchical structure. The function visualize_hierarchy displays the hierarchy. This information is useful to reduce the number of topics, either by specifying a value for the parameter nr_topics upon instantiation of BERTopic, or after the training by calling the function reduce_topics.

Next, we want to assign labels to each cluster. Compared to manually labeling thousands of samples, this task is much less burdensome!

This is usually a manual task. Assignment of labels is guided by the topic information, the topic word scores and the hierarchical clustering.

In our case, the actual labels are available, so that we can use this information to perform the labeling.

Let's inspect how well the clusters matches the labels. The graph below shows one column per topic. The shading indicates the distribution of labels within a given topic. The presence of a single dark patch in a column indicates that almost all of the samples of the topic are associated with a single label.

Obviously, the topic -1, which represents the outliers, has a finite frequency for many classes. Further, the classes 6 (WaterNW) and 7 (WaterW) seem to be difficult to tell apart from the clusters; this affects some of the topics. For most other topics, the clustering aligns quite well with the labels.

Overall, it appears reasonable to map each topic to the label with the highest frequency. Apart from the exceptions mentioned above, this aligns with a mapping that a human would define manually, in absence of the actual labels.

Therefore, let's define the mapping from topics to labels by picking the label with the highest frequency. The table below shows the topic info, enriched with the label counts and the mapping.

Now, let's apply this model to the validation set. First, we assign each sample to a cluster, based on the clustering model.

Then, we apply the mapping from topics to labels, which we have defined above based on the training set. The table below shows for each topic the frequency by label, and the mapping.

This classifier achieves an accuracy score of ca. 70%, compared to 30% obtained with the dummy classifier.

BERTopic provides the function find_topics which returns a list of IDs and similarity scores of topics that best match a given search term.

This is useful to validate the mapping. Let's use the search term "Fire" and retrieve the three most similar topics. For each of these topics, we print the similarity score and the label it was mapped to. We also show the word scores for each topic.

As expected, the topics which have been mapped to "Fire" appear first in the list, with similarity scores of more than 80%.

The first topic that was not mapped to "Fire" has a similarity score of less than 70%. It was mapped to the label "Vehicle". Indeed: Although the word "Fire" ranks second in the word score, this is in combination with hydrant. This is about vehicles hitting fire hydrants.

5.2. Refinement

Above, a relatively large number of samples was classified as outlier. All outliers were mapped to a single class, but this mapping is questionable, because we have seen that outlier samples belong to different classes.

To mitigate this issue, we could label the outlier samples manually. However, this is quite tedious.

Alternatively, we can train a classifier to the labels obtained from the unsupervised approach. To avoid label noise, we suppress outliers.

First, we create the training dataset. We replace the true labels by the labels obtained from the clustering approach.

Then, we evaluate the classifier on the test set, by comparing the predicted to the true labels.

The accuracy score has improved significantly.

6. Conclusions

Congratulations!

In this Part II of the tutorial, you have first applied the techniques you have learned in Part I to a dataset with shorter texts.

Then you have learned how to use zero shot classification in a situation with no labels. The beauty of this approach is that it requires no training and produces a reasonable classification by a list of user-defined expressions.

You have also seen that unsupervised classification can be achieved by similarity scoring between the input sequence and a list of user-defined expressions.

Going one step further, you have seen an approach that creates clusters of similar documents and represents each cluster by typical words. This can be used as a starting point to create meaningful labels.

If you have enjoyed this tutorial, feel free to apply any of the approaches - or improved versions, of course - to your own text data, to enrich your structured features available for supervised learning tasks.